One-sample t-test (Student’s t-test)#
The one-sample t-test answers:
Is the population mean plausibly equal to a reference value μ0?
Use it when you have one sample of numeric measurements, the population standard deviation is unknown, and you care about the mean.
Learning goals#
Know when the one-sample t-test is the right tool (and when it is not).
Write the null and alternative hypotheses (two-sided / one-sided).
Understand the t-statistic as a signal-to-noise ratio.
Interpret the p-value correctly (what it is and what it is not).
Implement the test with NumPy only (including an estimated p-value).
Build intuition with Plotly visuals (p-value area, df effects, sampling behavior, power).
Prerequisites#
Basic descriptive stats: mean, (sample) standard deviation.
The idea of a sampling distribution (statistics vary from sample to sample).
Optional: the z-test (mean test when σ is known).
import math
import platform
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)
print("Python", platform.python_version())
print("NumPy ", np.__version__)
try:
import plotly
print("Plotly", plotly.__version__)
except Exception:
pass
try:
import scipy
print("SciPy ", scipy.__version__)
except Exception:
pass
Python 3.12.9
NumPy 1.26.2
Plotly 6.5.2
SciPy 1.15.0
Intuition: signal-to-noise#
You observe a sample x1, x2, …, xn and want to compare its mean x̄ to a reference mean μ0.
The core question is not “is x̄ different from μ0?” (it almost always is, a little), but:
Is the difference large relative to the uncertainty in the mean estimate?
The t-statistic is exactly that ratio:
t = (x̄ − μ0) / SE
SE = s / √n
df = n − 1
numerator: signal (
x̄ − μ0)denominator: noise (estimated standard error of the mean)
Because we estimate the standard deviation with s (instead of knowing σ exactly), the statistic follows a Student t distribution under the null hypothesis.
When is it used?#
Typical use cases:
Quality control: Is the average fill volume equal to the labeled amount?
Performance/SLA: Is the mean latency higher than a target threshold?
Science/medicine: Is the mean change from baseline different from 0?
You are comparing a single sample mean to a fixed reference value.
If you are comparing two groups (two samples), you want a two-sample t-test (or Welch’s t-test).
Assumptions (and what happens if they fail)#
The classical one-sample t-test relies on:
Independence: observations are independent (no time dependence, no clustering).
Approximately normal data (or large
n):If the population is normal, the test is exact.
For larger
n, the Central Limit Theorem makes the mean closer to normal.
No extreme outliers: outliers inflate
sand can dominate the mean.
If normality is questionable and n is small, consider robust / nonparametric alternatives (e.g., sign test, Wilcoxon signed-rank test) or bootstrap confidence intervals.
Hypotheses#
Choose an alternative that matches the question before looking at the data:
Two-sided:
H0: μ = μ0vsH1: μ ≠ μ0Greater (right-tailed):
H0: μ = μ0vsH1: μ > μ0Less (left-tailed):
H0: μ = μ0vsH1: μ < μ0
Test procedure (recipe)#
Choose
μ0, a significance levelα(often 0.05), and the alternative.Compute
x̄,s, andSE = s/√n.Compute
t = (x̄ − μ0)/SEwithdf = n − 1.Compute the p-value as a tail probability under
T ~ t(df).Reject H0 if
p ≤ α.Report the estimate (mean), a CI for
μ, and an effect size.
Interpreting the result (what it means)#
The p-value is:
The probability, assuming the null hypothesis is true, of seeing a t-statistic at least as extreme as the one you observed.
So:
Small p-value → your data would be rare under H0 → evidence against
μ = μ0.Large p-value → your data is not unusual under H0 → you fail to reject H0.
Important: a large p-value does not prove μ = μ0. It usually means “not enough evidence with this sample size / noise level”.
A helpful companion is the confidence interval (CI) for μ:
A 95% two-sided CI is the set of means that would not be rejected by a 5% two-sided test.
If
μ0is outside the CI, you reject at that α.
Also separate statistical significance (p-value) from practical significance (effect size like Cohen’s d).
# Example data: fill volumes (ml). The label says 250ml.
mu0 = 250.0
# Synthetic sample: true mean slightly above 250, unknown variance
x = rng.normal(loc=252.0, scale=4.0, size=20)
print("n=", x.size)
print("mean=", float(x.mean()))
print("std=", float(x.std(ddof=1)))
x[:10]
n= 20
mean= 250.7343483866156
std= 3.124737477155004
array([252.0049, 253.195 , 250.9034, 248.4376, 250.1813, 248.0334,
252.2406, 257.3609, 250.0312, 249.5181])
# Always visualize the sample (outliers + skewness matter)
fig = go.Figure()
fig.add_trace(
go.Violin(
y=x,
box_visible=True,
meanline_visible=True,
points="all",
jitter=0.2,
name="sample",
)
)
fig.add_shape(
type="line",
x0=0,
x1=1,
xref="paper",
y0=mu0,
y1=mu0,
line=dict(color="rgba(214, 39, 40, 1)", dash="dash", width=2),
)
fig.add_annotation(
x=0.98,
y=mu0,
xref="paper",
text="μ0",
showarrow=False,
yshift=10,
font=dict(color="rgba(214, 39, 40, 1)"),
)
fig.update_layout(title="Sample vs reference μ0", yaxis_title="measurement (ml)")
fig.show()
NumPy-only implementation (from scratch)#
We can compute the t-statistic exactly with NumPy.
The only tricky part (if you restrict yourself to NumPy) is computing the t-distribution tail probability (the p-value) and the t critical value for the CI.
To keep everything NumPy-only and still match the definition (“tail area under the null distribution”), we estimate these probabilities via Monte Carlo:
Draw many samples from
T ~ t(df)usingnp.random.standard_t.Approximate probabilities as empirical frequencies.
This is not how you’d do production statistics (you’d use a specialized library for accurate CDF/PPF), but it’s a great way to understand what the p-value is.
def student_t_pdf(x: np.ndarray, df: int) -> np.ndarray:
"""Student t PDF computed from the definition (NumPy + standard library only)."""
x = np.asarray(x, dtype=float)
df = int(df)
if df <= 0:
raise ValueError("df must be a positive integer")
log_norm = math.lgamma((df + 1) / 2) - (
0.5 * math.log(df * math.pi) + math.lgamma(df / 2)
)
return np.exp(log_norm) * (1 + (x**2) / df) ** (-(df + 1) / 2)
def normal_pdf(x: np.ndarray) -> np.ndarray:
x = np.asarray(x, dtype=float)
return (1 / np.sqrt(2 * np.pi)) * np.exp(-0.5 * x**2)
def ttest_1samp_numpy(
x: np.ndarray,
mu0: float,
*,
alternative: str = "two-sided",
alpha: float = 0.05,
n_mc: int = 300_000,
seed: int = 123,
) -> dict:
"""One-sample t-test with a NumPy-only Monte Carlo p-value.
Parameters
- x: sample (1D array)
- mu0: null mean
- alternative: 'two-sided', 'greater', 'less'
- alpha: significance level for CI and decision
- n_mc: Monte Carlo sample size for approximating p-value and t critical values
"""
x = np.asarray(x, dtype=float)
x = x[~np.isnan(x)]
n = int(x.size)
if n < 2:
raise ValueError("Need at least 2 non-NaN observations.")
df = n - 1
mean = float(x.mean())
s = float(x.std(ddof=1))
se = s / np.sqrt(n) if s > 0 else 0.0
if se == 0.0:
t_stat = float(np.inf * np.sign(mean - mu0) if mean != mu0 else 0.0)
p_value = float(0.0 if mean != mu0 else 1.0)
t_crit = float("nan")
ci = (mean, mean)
cohen_d = float(np.inf * np.sign(mean - mu0) if mean != mu0 else 0.0)
decision = "reject H0" if p_value <= alpha else "fail to reject H0"
return {
"n": n,
"df": df,
"mu0": float(mu0),
"mean": mean,
"std": s,
"se": se,
"t_stat": t_stat,
"p_value": p_value,
"alpha": float(alpha),
"alternative": alternative,
"t_crit": t_crit,
"ci": (float(ci[0]), float(ci[1])),
"cohen_d": cohen_d,
"decision": decision,
"mc_samples": int(n_mc),
"mc_seed": int(seed),
}
t_stat = float((mean - mu0) / se)
cohen_d = float((mean - mu0) / s)
rng_local = np.random.default_rng(seed)
t_null = rng_local.standard_t(df, size=int(n_mc))
if alternative == "two-sided":
p_value = float(np.mean(np.abs(t_null) >= abs(t_stat)))
t_crit = float(np.quantile(t_null, 1 - alpha / 2))
ci = (mean - t_crit * se, mean + t_crit * se)
elif alternative == "greater":
p_value = float(np.mean(t_null >= t_stat))
t_crit = float(np.quantile(t_null, 1 - alpha))
ci = (mean - t_crit * se, np.inf)
elif alternative == "less":
p_value = float(np.mean(t_null <= t_stat))
t_crit = float(np.quantile(t_null, 1 - alpha))
ci = (-np.inf, mean + t_crit * se)
else:
raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")
decision = "reject H0" if p_value <= alpha else "fail to reject H0"
return {
"n": n,
"df": df,
"mu0": float(mu0),
"mean": mean,
"std": s,
"se": float(se),
"t_stat": t_stat,
"p_value": p_value,
"alpha": float(alpha),
"alternative": alternative,
"t_crit": t_crit,
"ci": (float(ci[0]), float(ci[1])),
"cohen_d": cohen_d,
"decision": decision,
"mc_samples": int(n_mc),
"mc_seed": int(seed),
}
res = ttest_1samp_numpy(x, mu0, alternative="two-sided", alpha=0.05, n_mc=500_000, seed=42)
print(f"n={res['n']}, mean={res['mean']:.3f}, std={res['std']:.3f}, SE={res['se']:.3f}")
print(f"t={res['t_stat']:.3f} (df={res['df']}), p≈{res['p_value']:.4f}, alpha={res['alpha']}")
print(f"95% CI for μ: [{res['ci'][0]:.3f}, {res['ci'][1]:.3f}]")
print(f"Cohen's d (one-sample): {res['cohen_d']:.3f}")
print("Decision:", res["decision"])
n=20, mean=250.734, std=3.125, SE=0.699
t=1.051 (df=19), p≈0.3066, alpha=0.05
95% CI for μ: [249.270, 252.199]
Cohen's d (one-sample): 0.235
Decision: fail to reject H0
# Optional validation against SciPy (production-grade distribution functions)
try:
from scipy.stats import ttest_1samp
scipy_res = ttest_1samp(x, popmean=mu0, alternative="two-sided")
print("SciPy t:", float(scipy_res.statistic))
print("SciPy p:", float(scipy_res.pvalue))
except Exception as e:
print("SciPy check skipped:", e)
SciPy t: 1.0510021553137605
SciPy p: 0.30644215697136534
Visual: the p-value is a tail area#
For a two-sided test, the p-value is the probability (under H0) of seeing |T| ≥ |t_obs|.
That probability corresponds to the red shaded area below.
t_obs = res["t_stat"]
df = res["df"]
tcrit = res["t_crit"]
xmax = max(6.0, abs(t_obs) + 1.0, abs(tcrit) + 1.0)
xx = np.linspace(-xmax, xmax, 3001)
yy = student_t_pdf(xx, df)
abs_t = abs(t_obs)
mask_left = xx <= -abs_t
mask_right = xx >= abs_t
fig = go.Figure()
fig.add_trace(go.Scatter(x=xx, y=yy, mode="lines", name=f"t pdf (df={df})"))
fig.add_trace(
go.Scatter(
x=xx[mask_left],
y=yy[mask_left],
mode="lines",
line=dict(width=0),
fill="tozeroy",
name="p-value tail",
showlegend=False,
fillcolor="rgba(214, 39, 40, 0.35)",
)
)
fig.add_trace(
go.Scatter(
x=xx[mask_right],
y=yy[mask_right],
mode="lines",
line=dict(width=0),
fill="tozeroy",
showlegend=False,
fillcolor="rgba(214, 39, 40, 0.35)",
)
)
ymax = float(yy.max())
for xline, dash, color, width in [
(t_obs, "dash", "rgba(214, 39, 40, 1)", 2),
(tcrit, "dot", "rgba(0, 0, 0, 0.6)", 1),
(-tcrit, "dot", "rgba(0, 0, 0, 0.6)", 1),
]:
fig.add_shape(
type="line",
x0=xline,
x1=xline,
y0=0,
y1=ymax,
xref="x",
yref="y",
line=dict(color=color, dash=dash, width=width),
)
fig.update_layout(
title=f"Two-sided p-value as tail area (t={t_obs:.3f}, p≈{res['p_value']:.4f})",
xaxis_title="t",
yaxis_title="density",
)
fig.show()
Visual: degrees of freedom control tail heaviness#
For small df, the t distribution has heavier tails than the standard normal.
As df → ∞, the t distribution approaches a normal distribution.
dfs = [1, 2, 5, 10, 30, 100]
xx = np.linspace(-5, 5, 2001)
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=xx,
y=normal_pdf(xx),
mode="lines",
name="Normal(0,1)",
line=dict(color="black", dash="dash"),
)
)
for i, df_ in enumerate(dfs):
fig.add_trace(
go.Scatter(
x=xx,
y=student_t_pdf(xx, df_),
mode="lines",
name=f"t(df={df_})",
visible=(i == 0),
)
)
steps = []
for i, df_ in enumerate(dfs):
visible = [True] + [False] * len(dfs)
visible[1 + i] = True
steps.append(
dict(
method="update",
args=[{"visible": visible}, {"title": f"Student t vs Normal — df={df_}"}],
label=str(df_),
)
)
fig.update_layout(
title=f"Student t vs Normal — df={dfs[0]}",
xaxis_title="x",
yaxis_title="density",
sliders=[
dict(
active=0,
currentvalue={"prefix": "df: "},
pad={"t": 30},
steps=steps,
)
],
)
fig.show()
Visual: under H0, the computed t-statistic follows a t distribution#
If the data really comes from a normal population with mean μ0, then the statistic
t = (x̄ − μ0) / (s / √n)
has a t(df=n−1) distribution.
We can check that by simulation.
rng_sim = np.random.default_rng(123)
B = 50_000
n = res["n"]
df = n - 1
mu0_sim = 0.0
x0 = rng_sim.normal(loc=mu0_sim, scale=1.0, size=(B, n))
t_stats = (x0.mean(axis=1) - mu0_sim) / (x0.std(axis=1, ddof=1) / np.sqrt(n))
xx = np.linspace(-6, 6, 2001)
yy = student_t_pdf(xx, df)
fig = go.Figure()
fig.add_trace(
go.Histogram(
x=t_stats,
nbinsx=80,
histnorm="probability density",
name="Simulated t-stat",
opacity=0.65,
)
)
fig.add_trace(
go.Scatter(x=xx, y=yy, mode="lines", name=f"t pdf (df={df})", line=dict(width=3))
)
fig.update_layout(
barmode="overlay",
title=f"t-statistic under H0 matches t(df={df})",
xaxis_title="t",
yaxis_title="density",
)
fig.show()
Visual: power depends on effect size and sample size#
Power is the probability of rejecting H0 when H1 is true.
If the true mean is μ = μ0 + δ, power increases when:
|δ|is larger (bigger effect)nis larger (smaller SE)α is larger (more aggressive rejection rule)
Below we estimate power by simulation for a two-sided test at α=0.05.
alpha = 0.05
# Generic setup for power: mu0=0, sigma=1 so δ is an effect size in "sigma units"
mu0_power = 0.0
sigma_power = 1.0
deltas = [0.2, 0.5, 0.8]
n_grid = np.array([5, 8, 12, 20, 30, 40, 60, 80, 100])
B_power = 12_000
B_crit = 120_000
rng_power = np.random.default_rng(202)
power = {d: [] for d in deltas}
for n in n_grid:
df = n - 1
t_null = rng_power.standard_t(df, size=B_crit)
tcrit = float(np.quantile(t_null, 1 - alpha / 2))
for d in deltas:
x_alt = rng_power.normal(
loc=mu0_power + d * sigma_power,
scale=sigma_power,
size=(B_power, n),
)
t_alt = (x_alt.mean(axis=1) - mu0_power) / (
x_alt.std(axis=1, ddof=1) / np.sqrt(n)
)
power[d].append(float(np.mean(np.abs(t_alt) >= tcrit)))
fig = go.Figure()
for d in deltas:
fig.add_trace(
go.Scatter(
x=n_grid,
y=power[d],
mode="lines+markers",
name=f"δ={d}σ",
)
)
fig.update_layout(
title="Estimated power vs sample size (two-sided t-test, α=0.05)",
xaxis_title="n",
yaxis_title="power",
yaxis=dict(range=[0, 1]),
)
fig.show()
Pitfalls and practical notes#
p-value is not “probability that H0 is true”. It’s
P(data as extreme | H0).Failing to reject does not prove equality; it often means low power.
Outliers can strongly affect the mean and inflate
s. Always plot the data.Independence is crucial. If you have time series dependence or repeated measures, use the right model/test.
If you test many hypotheses, consider multiple testing corrections.
Exercises#
Change
alternativeto'greater'and'less'. How does the CI change?Make
nsmaller (e.g., 5) and rerun. What happens to the t critical value and p-value stability?Add a single extreme outlier to
xand rerun. What happens to the mean,s, and the decision?Compare the Monte Carlo p-value to SciPy while increasing
n_mc. How fast does it converge?
References#
Standard introductory statistics texts: Student’s one-sample t-test
SciPy:
scipy.stats.ttest_1samp